
Abstract 


Motivation: Genome-wide association studies (GWASs), which assay more than a million single nu¬ 
cleotide polymorphisms (SNPs) in thousands of individuals, have been widely used to identify genetic risk 
variants for complex diseases. However, most of the variants that have been identified contribute relatively 
small increments of risk and only explain a small portion of the genetic variation in complex diseases. This 
is the so-called missing heritability problem. Evidence has indicated that many complex diseases are genet¬ 
ically related, meaning these diseases share common genetic risk variants. Therefore, exploring the genetic 
correlations across multiple related studies could be a promising strategy for removing spurious associations 
and identifying underlying genetic risk variants, and thereby uncovering the mystery of missing heritability 
in complex diseases. 

Results: We present a general and robust method to identify genetic patterns from multiple large-scale 
genomic datasets. We treat the summary statistics as a matrix and demonstrate that genetic patterns will 
form a low-rank matrix plus a sparse component. Hence, we formulate the problem as a matrix recover¬ 
ing problem, where we aim to discover risk variants shared by multiple diseases/traits and those for each 
individual disease/trait. We propose a convex formulation for matrix recovery and an efficient algorithm 
to solve the problem. We demonstrate the advantages of our method using both synthesized datasets and 
real datasets. The experimental results show that our method can successfully reconstruct both the shared 
and the individual genetic patterns from summary statistics and achieve better performance compared with 
alternative methods under a wide range of scenarios. 

Availability: The MATLAB code is available at:http : / /www. comp . hkbu . edu . hk/ ~xwan/low_ 
rank.zip. 


1 Introduction 

Many common human diseases, such as type-1 and type-2 diabetes, depression, schizophrenia, and prostate 
cancer, are influenced by several genetic and environmental factors. Scientists and public health officials have 
great interests to find genetic patterns associated with complex diseases, not only to advance our understanding 
of multi-gene disorders, but also to provide more insights into complex diseases. Disease association studies 
have provided substantial evidence for supporting that complex diseases originate in disorders of multiple 
genes [1,2]. Nevertheless, until recently the full-coverage identification of the genetic variants contributing to 
complex diseases has been staggering and difficult. 

After the completion of the Human Genome Project [3, 4] and the initiation of the International HapMap 
Project [5], interest has focused on genome-wide association studies (GWASs), in which the goal is to identify 
single-nucleotide polymorphisms (SNPs) that are associated with complex diseases (such as diabetes) or traits 
(such as human height). As of Dec. 2014, more than 15,000 SNPs have been reported to be associated with at 
least one disease/trait at the genome-wide significance level (P-value< 5 x 10“®) [6]. However, most of the 
findings only explain a small portion of the genetic contributions to complex diseases. For example, all of the 
18 SNPs identified in type 2 diabetes (T2D) only account for about 6% of the inherited risk [7]. There is still 
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a large portion of disease/trait heritability that remains unexplained. This is the so-ealled missing heritability 
problem [7, 8], whieh is often used to denote the gap between the expeeted heritability of many eommon 
diseases, as estimated by family and twin studies, and the overall additive heritability obtained by aeeumulating 
the effeets of all of the SNPs that have been found to be signifieantly assoeiated with these eonditions. 

A reeent study [9] has suggested that most of the heritability is not missing but ean be explained by the 
effeets of many genetie variants, with eaeh variant probably eontributing a weak effeet. However, finding 
variants with small effeets is very ehallenging in eomputation beeause the traditional single-loeus based test 
eannot identify sueh variants and the number of groups of multiple variants to be investigated in GWAS is 
astronomieal. In addition, in the high-dimensional and low-sample size settings of GWAS, many irrelevant 
variants tend to have high sample eorrelations due to randomness, whieh makes GWAS prone to false seientifie 
diseoveries. To solve the missing heritability problem, the large sample size is required, but sueh a requirement 
is usually beyond the eapaeity of a single GWAS, as the sample reeruitment is expensive and time eonsuming. 

Evidenee has indieated that many eomplex diseases are genetieally related [10, 11, 12, 13], meaning that 
these diseases share eommon genetie risk variants. This suggests that an integrative analysis of related genomie 
data eould be a promising strategy for removing spurious assoeiations and identifying risk genetie variants 
with small effeets, and thus finding fhe missing herifabilify of eomplex diseases. As high-fhroughpuf dafa 
aequisifion beeomes popular in biomedieal researeh, new eompufafional mefhods for large-seale dafa analysis 
beeome more and more imporfanf. 

When analyzing genomie dafa from mulfiple relafed sfudies, fhe ideal seenario is for fhe individual-level 
dafa fo be available for all of fhe ineluded sfudies, buf fhis may be diffieulf fo aehieve due fo resfriefions on 
sharing individual-level dafa. In faef, summary dafa (mosfly P-values or z-seores) are more frequenfly shared. 
To idenfify signiheanf SNPs shared by all of fhe ineluded sfudies, fhe eommonly used sfafisfieal approaeh is 
fo eombine P-values using Fisher’s mefhod [14]. [15] generalized Fisher’s mefhod fo inelude weighfs when 
eombining P-values. [16] suggesfed using fhe inverse normal fransformafion and Mosfeller and [17] furfher 
generalized Slouffer’s mefhod by ineluding weighf when eombining z-seores. There are fwo issues in sueh 
fradifional sfafisfieal approaehes. Firsf, one small P-value ean overwhelm many large P-values and dominate 
the test statistie. In a high-dimensional and low-sample size settings, many irrelevant variants tend to have 
high signiheanee due to randomness, whieh may eause wrong statistieal inferenees. Seeond, the information 
about genetie eorrelations between SNPs in the original data is eompletely lost after eombining P-values. 
This information is neeessary for understanding the genetie arehiteeture of eomplex diseases beeause common 
complex diseases are associated with multiple genetic variants. 

To identify shared genetic structures across multiple related studies, one feasible approach is to conduct a 
biclustering analysis on a matrix of summary statistics, in which the rows represent studies and the columns 
represent genetic variants, to simultaneously group studies and genetic variants. Many biclustering methods 
have been proposed and some comprehensive reviews of biclustering methods can be found in [18], [19], and 
[20]. However, the traditional biclustering methods do not perform well on genomic data because genomic data 
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is high dimensional and its most genetic variants are irrelevant. To obtain sparse and interpretable biclusters, 
a novel statistical approach, sparseBC, is recently proposed, which adopts an h penalty to the means of the 
biclusters [21]. A big drawback of sparseBC is that it does not allow for overlapping biclusters, which limits 
its application in genomic data analysis because the shared genetic patterns in GWASs may be very complex. 
Furthermore, in genomic data, besides the shared genetic structure, each disease/trait owns some distinct genetic 
variants. The typical biclustering model may treat them as noises and discard them. 

In this paper, we introduce a new method to identify genetic patterns in high dimensional genomic data. 
Our method possesses several advantages over existing works. First, our method admits a single model to detect 
both shared and individual genetic patterns among multiple studies. Second, our method employs two tuning 
parameters that control the size of the shared genetic pattern and the numbers of individual signals. The choices 
of these parameters have the solid theoretical support. Third, our method produces the unique global minimizer 
to a convex problem, which means that the solution is always stable. 

To demonstrate the performance of our proposed method, we conduct comparison experiments using both 
synthesized datasets and real datasets. Simulation results show that the proposed method outperforms existing 
methods in many settings. A large dataset containing 32 GWASs is also analyzed to demonstrate the advantage 
of our method. Specifically, we propose the convex formulation, the algorithm, and the parameter selection in 
Section 2. Simulation studies and real data analysis are presented in Section 3. We conclude the paper with 
some discussions in Section 4. 

2 Methods 

2.1 Formulation 

Mathematically, the summary statistics from multiple related studies can be expressed as a matrix D G 
where each entry dij is a z-score (if only P-values are available, we can transform them into z-scores), 
and n and p are the numbers of studies and SNPs, respectively. Our goal is to (1) detect shared genetic patterns 
across studies, which can be represented as sparse biclusters in this matrix and (2) detect individual genetic 
variants for each study, which we assume are randomly distributed and sparse. Since the sparsity of biclusters 
in a matrix indicates a low-rank property (please see examples in simulation studies), the problem of identifying 
these two types of genetic patterns can be treated as a problem of recovering a low-rank component X and a 
sparse component E from the input data D. Our proposed approach is based on the assumed sparsity of genetic 
patterns because in large-scale genomic data, most genetic variants are irrelevant. 

We propose to use the following decomposition model to detect genetic patterns from noisy input: 

D = X-hE-he, (1) 

where X is a low-rank component, E is a sparse component, and e is a noise component. In GWAS data 
analysis, the low-rank component corresponds to the causal SNPs that are shared by several diseases/traits. The 
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sparse component corresponds to the causal SNPs that affect one specific disease/trait. The noise component 
corresponds to the measurement error, which is often modeled by i.i.d. Gaussian distribution with a zero mean. 
Naturally, to achieve the decomposition, the following minimization problem is considered: 

min - ||e||p + arank (X) + /3||E||o 
X,E,e 2 

s.t. D = X + E + e, (2) 

where ||e||i7’ = ^e? is the Frobenious norm and ||E||o is the ^n-norm that counts the number of nonzero 
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values in E. The solution to Eq.(2) will give a penalized maximum likelihood estimate with respect to the 
variables X, E, e. 

However, the proposed model in Eq.(2) is intractable and NP-hard. Thus, in order to effectively recover X 
and E, we use the convex relaxation to replace the rank (•) by the nuclear norm and the £o-norm by the £i-norm. 
Here, the nuclear norm is defined as ||X||* = (Ti, where ai, - ■ ■ ,ar are fhe singular values of X. If is 

fhe fighfesf convex surrogate fo fhe rank operator [22] and has been widely used for low-rank mafrix recovery 
[23]. The £i-norm is defined as ||X||i = J2ij \ The relaxation has proven to be a powerful technique 
for sparse signal recovery [24]. 

Einally, insfead of direcfly solving Eq.(2), we solve fhe following problem, 

T'(X,F;) = min ^ ||D - X - E|||-T a||X||*-h ^||E||i. (3) 

If is easy fo prove fhaf Eq.(3) is a convex problem and Iherefore, fhe global opfimal solufion is unique. We 
will infroduce fhe algorifhm fo solve fhis opfimizafion problem in fhe nexf subsecfion. 

2.2 Algorithm 

The opfimizafion problem of Eq.(3) can be solved by alfernafively solving fhe following fwo sub-problems 
unfil convergence: 


X ^ argmin J^(X, E) 

(4) 

E argmin J^(X, E). 

E 

(5) 

The fheorefical proof for fhe convergence can be found in [25]. 

The problem in Eq.(4) can be reduced fo 


mm ^||D-E-X|||+a||X|U, 

(6) 

which becomes a nuclear-norm regularized leasf-squares problem and has fhe following closed-form solufion 

[26], 

X = (d - e) , 

(V) 
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where V\ refers to the singular value thresholding (SVT) 


Pa(M) = ^(c7i-A)+u,vf. (8) 

i=l 

Here, (x)+ = max(x, 0). {uj}, {v*}, and {cji} are the left singular veetors, the right singular veetors, and the 
singular values of M, respeetively. 

The problem in Eq.(5) ean be rewritten as 

minJ||D-X-E||| + /3||E||i. (9) 

E I 

It admits a elosed-form solution 

E = 5^ (d - x) , (10) 

where = sign(Mjj)(Mjj — /3)+ refers to the elementwise soft-thresholding operator [25]. 

Overall, the algorithm to optimize the proposed model in Eq.(3) is summarized in Algorithm 1. It will give 
a global optimal solution independent of initialization. 

Algorithm 1 The algorithm to solve Eq.(3). 

1. Input: D 

2. Initialize all variables to be zero. 

3. repeat 

4. Update X by solving Eq.(6) via singular value thresholding. 

5. Update E by solving Eq.(9) via soft thresholding. 

6. until eonvergenee 

7. Output: X and E 


2.3 Parameter selection 

There are two parameters in our model, whieh ean be estimated properly via the analysis of the size of the 
input matrix (n,p) and the standard variation of the noise a [23, 27]. 

The relative weight \ = fi/a balanees the two terms in a||X||* + /3||E||i and eonsequently eontrols the 
rank of X and the sparsity of E. [23] has proved that A = 1 /^/m gives a large probability of reeovering X and 
E under their assumed eonditions and stated that this value ean be adjusted slightly to obtain the best results 
in speeifie applieations. Here, m is the larger dimension of the input matrix. In our problem, m = p, i.e. the 
number of SNPs. However, on real datasets, the shared SNPs rarely form a perfeetly low-rank matrix, and we 
use 13 = 2a to keep suffieient variations in X. 

The parameter a serves as a threshold in the SVT step in Eq.(8). It should be large enough to threshold out 
the noise but not too large to over-shrink the signal [27]. A proper value is a = {^/n + ^/p)(y, whieh is the 
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expected £ 2 -norm of a n x p random matrix with entries sampled from AA(0, ct^). As SNPs are sparse in the 
data, we can estimate a from the data by the median-absolute-deviation estimator [28] 

a = 1.48 median{|D — median(D)|} . (11) 

3 RESULTS 

3.1 Simulation studies 

We first compare the performance of our method under four simulation studies, with three existing bi¬ 
clustering methods: sparseBC (sparse biclustering) [21], LAS [29] and SSVD [30]. Since biclustering methods 
search for sample-variable associations in the form of distinguished submatrices of the data matrix, we consider 
the entry {i,j) that belongs to one of the resulting biclusters which meet a predefined criferion as fhe reporfed 
associafion. Specially, for sparse biclusfering mefhod, we use fhe paramefers fhaf have been mentioned in [21], 
and fhe enfries in fhe clusfers which satisfy a preselecfed cufoff are recognized as fhe final resulf. For LAS, we 
use fhe defaull sellings. For SSVD lhal uses a varianl of singular value decomposilion lo find bicluslers, we 
fry differenl selling of parameters and reporl fhe besl one as ils resulf. LAS and SSVD can delecl overlapping 
bicluslers bul somelimes Ihey reporl Ihe entire malrix as one biclusler. Thus, for bolh LAS and SSVD, Ihe 
bicluslers lhal conlain Ihe entire malrix are discarded. For our melhod, Ihe parameters are selected as slated in 
Section 2. Then we use a Ihreshold T lo determine whelher Ihe enlries {i,j) of malrix is reported as Ihe resull 
or nol by comparing Ihe value of X{i,j) and E(i, j) wilh T. 

We evaluate each melhod in Ihe term of Fl-score, which can be calculated as following: 


precision = 


recall = 


tp 

tp + fp' 
tp 


tp + fn 


( 12 ) 

(13) 


FI-score = 


2 * precision * recall 


(14) 


precision + recall 

where tp and fp denote Ihe number of Irue positives and false positives, respectively, and fn denotes Ihe 
number of false negatives. 


3.1.1 Simulation settings 

We adopt four patterns (each in one simulation study) illustrated in Figure 1 to generate synthetic data. 

• Pattern 1 adopts a case from [30], which generated a rank-1 true signal matrix. Let M = duivj^ 
be a 100 X 50 matrix with d = 50, tii = [10, 9, 8, 7, 6, 5,4, 3, r(2,17), r(0, 75)], tti = [10,-10,8, 
—8, 5, —5, r(—3, 5), r(0, 34)]^, ui = ui/||ui|| 2 , and vi = tii/||tii|| 2 , where r(a, b) denotes a vector of 
length b with all entries equal a. This case simulates the shared causal SNPs among several studies. 
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Figure 1: Four scenarios in our simulation study. Pattern 1 contains a rank-1 component representing one 
bicluster. Pattern 2 adds some sparse signals in Pattern 1. Pattern 3 contains a rank-2 component representing 
two overlapped biclusters. Pattern 4 contains sparse signal in addition to overlapped biclusters. 























• Pattern 2 extends Pattern 1 by adding some sparse signals. That is, we generate a sparse eomponent E, 
whose entries are independently distributed, eaeh taking on value 0 with probability 1 — ps, and value 6 
with probability ps = 0.01. 

• Pattern 3 adopts the ease from [21], whieh generated two overlapping bielusters. Let M = d(uivj^ + 
U 2 vJ) be a 100 x 50 matrix with d = 50, ui and vi as defined in simulation 1, U 2 = [r(0,13), 10, 9, 8, 7, 

6, 5,4,3, r(2,17), r(0,62)], us = [r(0, 9), 10, -9, 8, -7, 6, -5, r(4, 5), r(-3, 5), r(0, 25)]^, U 2 = U 2 /\\u 2 
and V 2 = n 2 /||u 2 || 2 - 

• Pattern 4 extends Pattern 3 by adding some sparse signals following the same way as Pattern 2. 

3.1.2 Data generation 

Given a speeifie pattern mentioned above, we first generate the data matrix. To simulate the real situation, 
we randomly shuffle fhe rows and fhe eolumns. Nexf, we add Gaussian noise e ~ AA(0,1) fo eaeh ifem. Figure 
2 illusfrafes fhe groundfrufh dafa and fhe generafed dafa. For eaeh generafed dafa mafrix, we also eompufe fhe 
signal fo noise rafio (SNR). To illuslrafe how fhe mefhods perform for fhe dafa wifh differenl SNRs, we furlher 
seale down fhe ground frue signal by dividing fhe original values by 1.2 and 1.5, respeefively. 

3.1.3 Simulation results 

The results of four simulation studies are shown in Figure 3. We use Tow-rank’ to represent our method as 
our model is to find bielusters via a low-rank approximation. The details of the simulation results ean be found 
in the supplementary materials. In general, our proposed method aehieves eomparable performanee in the first 
and third simulation studies and performs better than other three methods in the seeond and fourth simulation 
studies. This is beeause the elassieal bielustering methods suffer from several limitations, sueh as missing 
some entries for overlapped bielusters and the inability to identify the disease/trait-speeifie entries. Figure 4 
shows one result in the fourth pattern. Our proposed method ean sueeessfully reeover a low-rank eomponent 
and a sparse eomponent from raw data. In the first simulation, the Fl-seores of sparse bielustering method 
and SSVD method almost get to 1. The reason why our method performs worse is that we use the default 
parameters whieh are not best fit for this simulation set-up. When adjusting the parameters, our method ean 
also get a high Fl-seore. Furthermore, we ean observe from Figure 3 that our method always perform equally 
well in terms of both preeision and reeall while the other three methods often favor preeision against reeall. In 
the large-seale data analysis, the eonservative method with high preeision and low reeall may not be suitable 
for new diseoveries beeause most signals are irrelevant. For sueh situations, our method has a elear advantage 
over eompetitors. 

3.2 Real application 

We applied our method to analyze 32 independent diseases/traits, ineluding 
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Simulation 1 (Pattern 1, SNR = 2.5) 




Simulation 2 (Pattern 2, SNR = 3.3) 




Simulation 3 (Pattern 3, SNR = 2.6) 




Simulation 4 (Pattern 4, SNR = 2.9) 



Figure 2: Illustrations of four simulations. For each simulation, the generated matrix with noises is shown in 
the left panel and the groundtruth matrix is shown in the right panel. In the groundtruth matrix, the red entries 
indicate the true signals. 
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Figure 3: Comparison results of different methods in four simulation studies, each using one pre-defined pat¬ 
tern. 
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Figure 4: An illustration of the simulation result. The low-rank component and the sparse component are 
recovered by our method. 
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3 anthropometries related data. 


• 9 pysehiatry related data. 

• 8 CAD data. 

• 2 soeial seienee studies 

• 2 glyeaemie traits 

• 7 inflammatory bowel disease data. 

• systemie lupus erythematosus 

• parkinson 

The details of the data sets ineluding the referenees and the web link for downloading the data ean be found 
in the supplementary materials. Sinee eaeh study reports different SNPs, we take the SNPs that are reported 
by at least 28 diseases/traits and obtain their P-values and impute the missing ones. Finally, we get a P-value 
matrix P G ^^466423x32 32 diseases/traits. Next, we eonvert the P-value matrix to the z-seore matrix 

Z G 27^66423x32 analyze this data set using our method on a desktop PC with 2.40GHz CPU and 4GB 
RAM. The running time of our method on 32 GWASs data sets is only 152.1s. The three alterative methods 
investigated in this work eannot be applied due to the large size of the data. 

The experiment results are given in Figure 5. The shared eausal SNPs are presented in the low-rank eom- 
ponent and individual-speeifie SNPs are shown in the sparse eomponent. We take the first three right singular 
veetors of the reeovered low-rank matrix and use them as the eoordinate of eaeh study in Figure 6. From Figure 
6, it is elear to see that 3 elusters are reeovered from 32 diseases/traits: 

• 2 soeial seienee studies (edu_years and eollege); 

• diastolie blood pressure and systolie blood pressure (DBP and SBP); 

• total eholesterol and low density lipoprotein (TC and LDL). 

The diseases/traits in eaeh eluster are highly relevant with eaeh other. We eompare the identified eausal 
SNPs by our method on 32 GWAS data with some previous findings. For 3 pairs of diseases/traits that are 
elustered together, we mainly investigate the shared SNPs that are identified by our method. For two soeial 
seienee related data, our method has deteeted SNP rs3789044, SNP rs12046747, and SNP rsl2853561, whieh 
are mapped to genes LRRN2 and STK24, respeetively. These were reported in the original artiele [31] beeause 
they have signifieant P-values (the details are provided in the supplementary materials). However, besides 
those SNPs with signifieant P-values, our method has also identified some loeus with moderate P-values. SNP 
rs2532269, whose original P-values are 1.01 x 10“^ in edu.years data and 1.11 x 10“^ in eollege data, is 
deteeted as a eausal SNP by our method. This SNP was previously reported (P-value = 2 x 10”^^) [32] and 
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Figure 5: The experiment results on 32 GWASs. The low-rank component (middle panel) and the sparse 
component (right panel) are recovered by our method. 

mapped to the gene KIAA1267. This gene is highly connected with Koolen-De Vries syndrome. Koolen-De 
Vries syndrome is characterized by moderate to severe intellectual disability, hypotonia, friendly demeanor, 
and highly distinctive facial features, including tall, broad forehead, long face, upslanting palpebral fissures, 
epicanthal folds, tubular nose with bulbous nasal tip, and large ears [33]. 

For diastolic blood pressure and systolic blood pressure, the identified SNPs in our experimenf are also 
connecfed with some previously published genes, such as ULK4, FGF5 and ClOotflO? [34]. Similarly, some 
additional locus are identified by fhe low-rank componenf. SNP rs4986172 (original P-values in SBP dafa and 
DBP dafa are 3.09 x 10“® and 0.0172, respecfively), located in fhe gene ACBD4, is detected by fhe low-rank 
componenf. This gene has been associated wifh high blood pressure in [35]. 

To illusfrale fhe power of our method in identifying the causal SNPs that do not shared by several dis¬ 
eases/traits, we take the result of bipolar disorder as an example. The SNPs in the result of bipolar disorder 
can be matched to ANK3, CACNAIC, SYNEl and PBRMl, which have been confirmed fo be associated wifh 
bipolar disorder [36]. The defailed resulfs of ofher diseases/fraifs can be found in fhe supplemenfary maferials. 
Clearly, fhe experimenf resulfs show fhaf nof only can our mefhod recognize SNPs wifh small P-values, buf 
also defecf those SNPs with moderate or weak P-values. 
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Figure 6: The geometric relationships of all studies using the coordinates derived from the first three right 
singular vectors of the recovered low-rank matrix. 

4 Discussion 

Finding weak-effect variants to explain the missing heritability of complex diseases is a challenging task 
and bottlenecked by the available sample size of GWAS. Based on the fact that related diseases/traits tend to 
co-occur, discovering shared genetic components among related studies becomes a popular way to address this 
issue. In the last few years, hundreds of GWASs have been carried out. Therefore, it is timely to systematically 
investigate GWAS data sets to find those shared patterns for comprehensive understanding of the genetic archi¬ 
tecture of complex diseases/traits. In this work, we present a novel method for exploring the genetic patterns of 
complex diseases. We assume that causal SNPs can be divided into two categories: SNPs shared by multiple 
diseases/traits and SNPs for individual disease/trait. Thus, by modeling the problem as recovering a low-rank 
component and a sparse component from a noise matrix, we formulate it as a convex optimization problem. To 
demonstrate the performance of our proposed method, we conduct several simulation studies under different 
settings. Simulation results show that the proposed method outperforms three alternative methods in many set¬ 
tings. In the real data studies, we collect 32 large-scale GWAS data sets. We have successively analyzed these 
data sets via our proposed method and discovered some interesting shared genetic patterns. Many identified 
variants have been confirmed by other works. To conclude, our proposed method not only possesses a better 
power than related methods but also provides easily interpretable results for better understanding shared genetic 
architectures of complex diseases/trais. 
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In this work, we mainly focus on the analysis of summary statistics. With the development of new tech¬ 
nology, more and more supplementary information, such as functional annotation data, structural data, and 
biochemical data, can be quickly obtained. In the future work, we will integrate these information in our 
method to increase the statistical power. 
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Data descriptions 

We applied our method to analyze 32 independent diseases/traits, ineluding 

• 3 anthropometries related data: body mass index [37], height [38], waist-hip ratio adjusted for BMI [39]. 

• 9 pysehiatry related data: five PGC data [13] (attention-defieit/hyperaetivity disorder, autism speetrum 
disorder, bipolar disorder, major depressive disorder, sehizophrenia) and four TAG data [40] (TagCPD, 
TagEVRSMK, TagFORMER, TagEOGONSET). 

• 8 CAD data: total eholesterol [41], low density lipoprotein [41], triglyeerides [41], high density lipopro¬ 
tein [41], type 2 diabetes [2], eoronary artery disease [42], diastolie blood pressure [34], systolie blood 
pressure [34]. 

• 2 soeial seienee related data [31]: edu_years, eollege. 

• 2 glyeaemie traits [43]: fasting glueose, fasting insulin. 

• 7 inflammatory bowel disease data: erohn’s disease [44], multiple selerosis [45], psoriasis [46], rheuma¬ 
toid arthritis [47], type 1 diabetes [48], uleerative eolitis [49]. 

• systemie lupus erythematosus [50]. 

• parkinson [51]. 
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Table 1: 32 GWASs data. 


Name 

#ofSNPs 

Link 

body mass index [37] 

2471516 

http://WWW.broadinstitute.org/collaboration/ 

giant/index.php/ 

height [38] 

2469635 

http://www.broadinstitute.org/collaboration/ 

giant/index.php/ 

erohn’s disease [44] 

953241 

http://WWW.ibdgenetics.org/downloads.html 

fasting glueose [43] 

2628879 

http://www.magicinvestigators.org/downloads/ 

total eholesterol [41] 

2693413 

http://www.sph.umich.edu/csg/abecasis/public/ 

lipids2010/ 

low density lipopro¬ 
tein [41] 

2692564 

http://WWW.sph.umich.edu/csg/abecasis/public/ 

lipids2010/ 

triglycerides [41] 

2692560 

[41] 

http://www.sph.umich.edu/csg/abecasis/public/ 

lipids2010/ 

high density lipopro¬ 
tein [41] 

2692429 

http://www.sph.umich.edu/csg/abecasis/public/ 

lipids2010/ 

coronary artery dis¬ 
ease [42] 

2420360 

http://www.cardiogramplusc4d.org/downloads/ 

college [31] 

2321510 

http://ssgac.org/Data.php 

diastolic blood pres¬ 
sure [34] 

2461325 

http://www.ncbi.nlm.nih.gov/projects/gap/ 

cgi-bin/study.cgi?study\_id=phs000585.vl.pl 

systolic blood pressure 

[34] 

2461325 

http://www.ncbi.nlm.nih.gov/projects/gap/ 

cgi-bin/study.cgi?study\_id=phs000585.vl.pl 

eduyears [31] 

2310087 

http://ssgac.org/Data.php 

fasting Insulin [43] 

2627848 

http://www.magicinvestigators.org/downloads/ 

multiple sclerosis [45] 

327094 

http://www.ncbi.nlm.nih.gov/projects/gap/ 

cgi-bin/analysis.cgi?study\_id=phs00013 9.vl.pi\ 

&phv=65549\&phd=1061\&pha=2854\&pht=621\&phvf= 

\&phdf=\&phaf=\&phtf=\&dssp=l\&consent=\&temp=l 

parkinson [51] 

453217 

http://WWW.ncbi.nlm.nih.gov/projects/gap/ 

cgi-bin/analysis.cgi\?study\_id=phs00008 9.v3. 

p2&phv=24040&phd=392&pha=2868&pht=178&phvf= 

Sphdf=&phaf=&phtf=&dssp=l&consent=&temp=l 
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Table 2: 32 GWASs data. 


Name 

#ofSNPs 

Link 

attention- 

deficit/hyperactivity 

disorder [13] 

1219805 

http://WWW.med.unc.edu/pgc/downloads 

autism speettum disor¬ 
der [13] 

1219805 

http://WWW.med.unc.edu/pgc/downloads 

bipolar disorder [13] 

1219805 

http://WWW.med.unc.edu/pgc/downloads 

major depressive dis¬ 
order [13] 

1219805 

http://WWW.med.unc.edu/pgc/downloads 

schizophrenia [13] 

1219805 

http://WWW.med.unc.edu/pgc/downloads 

psoriasis [46] 

440153 

http://www.ncbi.nlm.nih.gov/projects/gap/ 

cgi-bin/analysis.cgi\?study_id=phs000019.vl. 

pl\&phv=2 0012\&phd=17 9\&pha=2 855\&pht = 63\&phvf= 

\&phdf=\&phaf=\&phtf=\&dssp=l\&consent=\&temp=l 

rheumatoid arthritis 

[47] 

2556271 

http://www.broadinstitute.org/ftp/pub/ 

rheumatoid\_arthritis/Stahl\_etal\_2010NG/ 

type 1 diabetes [48] 

503181 

http://WWW.ncbi.nlm.nih.gov/projects/gap/ 

cgi-bin/analysis.cgi\?study_id=phs000180.v2.p2\ 

&phv=7 34 62\&phd=154 8\&pha=2 8 62\&pht=7 8 9\&phvf= 

\&phdf=\&phaf=\&phtf=\&dssp=l\&consent=\&temp=l 

type 2 diabetes [2] 

2473441 

http://diagram-consortium.org/downloads.html 

TagCPD [40] 

2459118 

http://www.med.unc.edu/pgc/downloads 

TagEVRSMK [40] 

2455846 

http://www.med.unc.edu/pgc/downloads 

TagFORMER [40] 

2456554 

http://WWW.med.unc.edu/pgc/downloads 

TagEOGONSET [40] 

2457545 

http://WWW.med.unc.edu/pgc/downloads 

ulcerative colitis [49] 

1428749 

http://WWW.ibdgenetics.org/ 

waist-hip ratio ad¬ 
justed for BMI [39] 

2483326 

http://www.broadinstitute.org/collaboration/ 

giant/index.php/GIANT\_consortium\_data\_files 

systemic lupus erythe¬ 
matosus [50] 

258402 

http://WWW.ncbi.nlm.nih.gov/projects/gap/ 

cgi-bin/analysis.cgi?study_id=phs000122.vl. 

pl&phv=6633 6&phd=&pha=284 8&pht=62 9&phvf=&phdf= 

Sphaf=&phtf=&dssp=l&consent=&temp=l 
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Table 3 : Results of four methods when simulations are generated from pattern 1 with different SNRs. 



SNR = 2.5 

SNR = 2.1 

SNR = 1.7 

Sparsebc 

SSVD 

LAS 

Low-rank 

Sparsebc 

SSVD 

LAS 

Low-rank 

Sparsebc 

SSVD 

LAS 

Low-rank 

Precision 

0.95 

0.95 

1 

0.84 

0.88 

0.94 

1 

0.75 

0.71 

0.92 

1 

0.61 

Recall 

0.95 

0.99 

0.19 

0.82 

0.96 

0.94 

0.17 

0.82 

0.70 

0.76 

0.15 

0.82 

FI-score 

0.95 

0.96 

0.32 

0.83 

0.92 

0.94 

0.29 

0.78 

0.70 

0.82 

0.26 

0.70 


Table 4: Results of four methods when simulations are generated from pattern 2 with different SNRs. 



SNR = 3.3 

SNR = 2.8 

SNR = 2.2 

Sparsebc 

SSVD 

LAS 

Low-rank 

Sparsebc 

SSVD 

LAS 

Low-rank 

Sparsebc 

SSVD 

LAS 

Low-rank 

Precision 

0.72 

0.80 

0.98 

0.85 

0.69 

0.87 

0.97 

0.75 

0.72 

0.80 

1 

0.69 

Recall 

0.68 

0.80 

0.27 

0.85 

0.57 

0.72 

0.22 

0.86 

0.46 

0.63 

0.18 

0.74 

FI-score 

0.69 

0.80 

0.42 

0.85 

0.61 

0.79 

0.36 

0.80 

0.56 

0.71 

0.30 

0.71 


Table 5 : Results of four methods when simulations are generated from pattern 3 with different SNRs. 



SNR = 2.6 

SNR = 2.2 

SNR = 1.8 

Sparsebc 

SSVD 

LAS 

Low-rank 

Sparsebc 

SSVD 

LAS 

Low-rank 

Sparsebc 

SSVD 

LAS 

Low-rank 

Precision 

0.99 

0.74 

1 

0.84 

0.85 

0.74 

1 

0.75 

0.87 

0.78 

1 

0.77 

Recall 

0.61 

0.79 

0.29 

0.86 

0.63 

0.79 

0.22 

0.84 

0.56 

0.69 

0.22 

0.75 

FI-score 

0.76 

0.77 

0.45 

0.85 

0.73 

0.76 

0.37 

0.79 

0.68 

0.73 

0.37 

0.76 


Table 6: Results of four methods when simnlations are generated from pattern 4 with different SNRs. 



SNR = 2.9 

SNR = 2.4 

SNR = 1.9 

Sparsebc 

SSVD 

LAS 

Low-rank 

Sparsebc 

SSVD 

LAS 

Low-rank 

Sparsebc 

SSVD 

LAS 

Low-rank 

Precision 

0.86 

0.76 

0.99 

0.80 

0.81 

0.85 

1 

0.78 

0.75 

0.86 

1 

0.71 

Recall 

0.64 

0.68 

0.27 

0.83 

0.61 

0.48 

0.20 

0.76 

0.50 

0.48 

0.19 

0.71 

FI-score 

0.72 

0.72 

0.43 

0.82 

0.69 

0.62 

0.23 

0.77 

0.59 

0.62 

0.32 

0.71 
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